Chromosomal-level assembly of Tokudaia osimensis, Tokudaia tokunoshimensis, and Tokudaia muenninki genomes

Herein, we present the first high-quality long-read-based chromosome-level genome assemblies and gene annotations of the genomes of three endangered Tokudaia species: Tokudaia osimensis, Tokudaia tokunoshimensis, and Tokudaia muenninki. These species, which are endemic to different islands of the Ryukyu Islands, Japan, exhibited unique karyotypes and sex chromosomal characteristics. The genome assemblies generated using PacBio, Illumina, and Hi-C sequence data consisted of 13 (corresponded to 12 autosomes and one X chromosome), 23 (corresponded to 22 autosomes and one X chromosome), and 23 (corresponded to 21 autosomes and the neo- and ancestral X regions) chromosome-level scaffolds that contained 2,445, 2,477, and 2,661 Mbp of sequence data, respectively. Annotations of protein-coding genes were performed using RNA-Seq-based, homology-based, and Ab initio methods. BUSCO completeness values for every species exceeded 96% for genomes and 98% for genes. These data can be an important resource for contributing to our understanding of species genomes resulting from allopatric speciation and provide insights into mammalian sex-determination mechanisms and sex chromosome evolution.


Background & Summary
Tokudaia is a genus of murine rodents consisting of only three species of spiny rat, Tokudaia osimensis, Tokudaia tokunoshimensis, and Tokudaia muenninki, all of which have been listed as endangered because of their declining populations in recent years 1 .The three Tokudaia species are endemic to Amami-Oshima, Tokunoshima, and Okinawa Islands of the Ryukyu Islands, Japan.These islands were probably formed by repetitive sea regressions and transgressions among the Central Ryukyu Islands because of sea-level changes caused by glacial-interglacial cycles since the early Pleistocene 2 .Phylogenetic relationships between the three Tokudaia species correspond to the geographical distances between the three islands 3 .The three spiny rat species exhibit different karyotypes and unique sex chromosome characteristics.For example, the diploid number of chromosomes (2n) for both sexes in T. osimensis is 25, in T. tokunoshimensis is 45, and in T. muenninki is 44 [4][5][6] .Moreover, both T. osimensis and T. tokunoshimensis have lost their Y chromosomes and Sry, a sex-determining gene in mammals 7 .Therefore, both males and females of these two species exhibit the XO/XO karyotype, with only one X chromosome; 4,5 these species probably exhibit an Sry-independent sex-determination mechanism, different from that of common XX/XY-type mammals (Fig. 1).Various studies have been conducted to understand the sex-determination mechanism of these spiny rats, including experimental [8][9][10] and genome sequence-based approaches.The results revealed that genes on the Y chromosome, other than Sry, have been conserved in the genome by translocation to the X chromosome 11,12 .Additionally, results showed a male-specific copy number variation (CNV) region upstream of Sox9 on autosomal chromosomes, including the Sox9 enhancer 14, which is involved in regulating Sox9 and Foxl2 expression, which are related to testicular differentiation 13 .
In contrast, T. muenninki exhibits an XX/XY sex chromosome configuration, similar to that of typical mammals; however, it contains a large neo-sex chromosome with autosomes, fused to the X and Y chromosomes 14,15 .A study showed that genes near the ancestral X and Y regions of the neo-X and neo-Y chromosomes accumulated male-specific mutations 15 .Unlike T. osimensis and T. tokunoshimensis, Sry has not been lost in T. muenninki and has multiple copies on the Y chromosome 3,16 .However, reportedly, Sry in T. muenninki is suspected to have lost its function as a sex-determining gene 9,17 .These three spiny rat species are suitable research subjects; however, their genome-based research has been limited to some RNA sequencing (RNA-Seq)-based 15,18 , fluorescence in situ hybridization-based 14,19 , and limited numbers of bacterial artificial chromosomes (BACs)-based 16 studies, except for the T. osimensis genome, which has been studied by short-read sequencing 13 .
Herein, we provide the first report of high-quality, long-read-based chromosome-level genome assemblies and gene annotations for the three Tokudaia species.Our results can provide a valuable foundation for future studies regarding mammalian sex-determination mechanisms and sex chromosome evolution.Moreover, the constructed dataset may provide a valuable resource for the direct comparison of the genomes of species formed because of allopatric speciation over the last several million years.

Methods
Sample collection and sequencing.Whole-genome shotgun sequencing was performed using the PacBio and Illumina sequencing platforms.The genomic DNAs of T. osimensis, T. tokunoshimensis, and T. muenninki were isolated from the muscle tissues of male specimens that had accidentally perished (recovered in December 2005, December 2011, and March 2013, respectively) and had been stored in a deep freezer using the smart DNA prep Kit (Anlytik Jena, Jena, Germany).Genomic DNA from T. osimensis and T. muenninki was sheared into 30-100 kb fragments using a g-tube device (Covaris Inc., MA, USA).Although the DNA of T. tokunoshimensis was slightly degraded, this procedure was omitted.Subsequently, a continuous long-read (CLR) single-molecule real-time (SMRT) bell library was prepared using the SMRTbell Express Template Prep Kit 2.0 (Pacific Bioscience, CA, USA) according to the manufacturer's instructions.Three CLR libraries were size-selected using the BluePippin system (Saga Science, MA, USA) with a lower cutoff of 40 kb.For each species, the library was run on two SMRT Cell 8Ms with Binding Kit 2.0 and Sequencing Kit 2.0.Through PacBio sequencing, 254.2 Gb, 272.1 Gb, and 271.8 Gb of CLRs were obtained for T. osimensis, T. tokunoshimensis, and T. muenninki, respectively (Table 1).Additionally, genomic DNA was fragmented to an average size of 500-600 bp Fig. 1 Habitat and phylogenetic relationships of three spiny rats.Karyotype information and evolutionary events assumed to have occurred in the ancestral species are shown.This map is based on a GSImap 74 published by the Geospatial Information Authority of Japan.Photo courtesy: Kimiyuki Tsuchiya.
using Focused-ultrasonicator M220 (Covaris Inc., MA. USA).Paired-end libraries with 450-550 bp insert sizes were constructed using the TruSeq DNA PCR-Free Library Prep kit (Illumina, CA, USA) and size-selected on an agarose gel using the Zymoclean Large Fragment DNA Recovery Kit (Zymo Research, CA.USA).The final libraries were sequenced following the 2 × 150 bp paired-end protocol for HiSeq 2500 (T.osimensis) and NovaSeq 6000 (T.tokunoshimensis and T. muenninki) systems (Illumina, San Diego, CA, USA).After Illumina sequencing, we obtained 448.1 Gb, 267.2 Gb, and 334.3 Gb of paired-end reads for T. osimensis, T. tokunoshimensis, and T. muenninki, respectively (Table 1).The Omni-C library was prepared using the Dovetail Omni-C Kit (Dovetail Genomics, Scott Valley, CA, USA) according to the manufacturer's protocol.1 × 10 6 cultured cells were collected from the same T. osimensis individual from which the genomic DNA was extracted.The pulverized sample was processed into a proximity ligation library using the Omni-C Proximity Ligation Assay Mammalian Samples Protocol of the Omni-C Kit.The final library was sequenced using NovaSeq 6000 (Illumina) with a 2 × 150 bp read length, and 145.3 Gb of Omni-C reads were generated for T. osimensis (Table 1).The Arima Hi-C libraries were constructed from the same tissue (98.1 mg) of T. tokunoshimensis and cerebrum (177.4 mg) of the same T. muenninki individual from which genomic DNAs were extracted, using the Arima-HiC + Kit (Arima Genomics, CA, USA) according to the manufacturer's instructions regarding Animal Tissues (A160132 v01) and library preparation (A160137 v00) using the TruSeq DNA PCR-Free Library Prep kit (Illumina, CA, USA).The obtained Hi-C libraries were run on the Illumina NovaSeq 6000 system with a 2 × 150 bp read length, and 153.8 Gb and 164.0 Gb of Arima Hi-C reads were generated for T. tokunoshimensis and T. muenninki, respectively (Table 1).

Genome size and heterozygosity estimation. The genome size and heterozygosity of the three
Tokudaia species were estimated using the Illumina sequencing data and a k-mer-based method.The sequenced reads from Illumina were filtered using Platanus_trim v1.0.7 (http://platanus.bio.titech.ac.jp/pltanus_trim) with default parameters.Using the trimmed reads, Jellyfish v2.3.0 20 was applied to extract and count the canonical k-mers at k = 32.Following this, GenomeScope 2.0 21 was used to estimate the haploid genome sizes and heterozygosities based on k-mer count data with parameters of "-k 32 -p 2" (Fig. 2).Thus, we estimated a haploid genome size of 2,472.5 Mbp with 0.53% heterozygosity for T. osimensis, 2,472.6Mbp with 0.25% heterozygosity for T. tokunoshimensis, and 3,362.6Mbp with 0.45% heterozygosity for T. muenninki (Fig. 2).The genome size estimation result that the T. muenninki has a genome size approximately 1 Gb larger than that of the other two species is consistent with a previous study showing that T. muenninki contains large heterochromatic blocks 14 .
We performed Hi-C scaffolding using Hi-C datasets to obtain chromosome-level assemblies of the three Tokudaia species.The examination of input-purged assembly contigs before Hi-C scaffolding revealed the presence of sex chromosome-derived contigs that appeared to have been incorrectly removed by Purge_Dups.Therefore, Hi-C scaffolding was performed after reverting the sex chromosome-derived contigs, which were determined to have been erroneously deleted using the Hi-C read contact information.The Omni-C/Arima Hi-C reads were then mapped to the input contigs and processed to generate Hi-C contacts using Juicer v1.6 26 .Thereafter, the Hi-C contacts files were used for scaffolding by 3D-DNA v180922 27 with the following parameter settings: "-editor-coarse-resolution 50000-editor-coarse-region 250000-input 1000".We visualized the Hi-C contact map using Juicebox v1.11.08 28 , checked the mapping information of sequence reads on Integrated Genomics Viewer (IGV) 29 , and performed extensive manual curation to fix mis-assemblies, mis-scaffoldings, and mis-foldings caused by duplicated regions.Additionally, the mitochondrial genome sequences were constructed from the Illumina short reads of each individual using GetOrganelle v1.7.5.0 30 .All constructed mitochondrial sequences were circularized, and the obtained sequence lengths were 16   2).The length of constructed genomes of T. osimensis and T. tokunoshimensis were near their estimated genome sizes, 2,472.5 Mbp and 2,472.6Mbp, respectively, whereas that of T. muenninki was only approximately 702 Mbp less than its estimated genome size of 3,362.6Mbp.It was assumed to be because of the inability to construct the sequences of the heterochromatin region, which are highly repetitive.In T. osimensis, 99.79% of the assembled contigs were incorporated into 13 scaffolds, which corresponded to 12 autosomes and one X chromosome, suggesting that all 2n = 25 chromosomal karyotypes were constructed.Similarly, in T. tokunoshimensis, 99.87% of the assembled contigs were incorporated into 23 scaffolds, which corresponded to 22 autosomes and one X chromosome, suggesting that all 2n = 45 chromosomal karyotypes were constructed.Conversely, in T. muenninki, 91.31% of the assembled contigs were incorporated into 23 scaffolds, which were considered to correspond to 21 autosomes and the neo-and ancestral X regions in T. muenninki, with 2n = 44.For the ancestral Y region, the orientation and order of the candidate contigs could not be determined through Hi-C scaffolding, and chromosome-level scaffolds were not constructed.Therefore, sequences derived from the ancestral Y region The estimated haploid genome sizes for T. osimensis, T. tokunoshimensis, and T. muenninki were 2,472.5 Mbp with 0.53% heterozygosity, 2,472.6Mbp with 0.25% heterozygosity, and 3,362.6Mbp with 0.45% heterozygosity, respectively.T. muenninki, with a genome size approximately 1 Gb larger than that of the other two species, shows consistency with a previous study demonstrating the presence of large heterochromatic blocks.
were constructed as unplaced scaffolds.The unplaced scaffold accounted for 8.69% (231.3Mbp) of the entire genome sequence, suggesting that it contained sequences of the heterochromatin region in addition to those of the ancestral Y region.
The chromosome numbers and orientations of the scaffolds were determined based on the synteny relationship examined among the three constructed chromosome-scale scaffolds and mouse chromosomes by extracting the best bidirectional alignments using minimap2 31 and corresponding them to the results of previous chromosome painting studies 14,19 .Figure 3 illustrates the Hi-C contact map constructed using the rearranged sequences.In all three species, the X chromosome had half of the coverage; therefore, the contacts were weak, whereas in T. muenninki, the fragmented Y chromosome-derived sequences were clustered in the lower-right corner of the contact map as unplaced scaffolds.
Gene structure prediction and functional annotation.Gene structures were predicted using the following methods: (1) an RNA-Seq-based method that predicts gene structures on the basis of transcriptome sequencing results, (2) a homology-based method that predicts gene structures on the basis of protein sequences of related species, and (3) an Ab initio method that learns the features of gene regions and predicts them directly from the genome sequence.The prediction results of the three methods were integrated using an integration tool.The three prediction methods are described below.
The results predicted by the three methods were integrated into the GINGER pipeline 52 .Additionally, predicted genes that satisfied all the requirements of the confidential M. musculus 42 or R. rattus 46 full-length homology-based results were extracted and integrated into the GINGER results.The requirements were as follows: Compared with mouse or rat orthologous genes, predicted genes should have (a) 80% or more identity and sequence coverage, (b) less than 3-amino acid length differences with all exons, and (c) all introns should be 0.5 to 2.0 times the length of the corresponding introns of the mouse or rat.If an exon overlapped between the newly predicted exons and the GINGER exon after integration, the exon with a longer CDS was used for the final result.
The statistics for the predicted gene sets are listed in Table 3.A total of 22,233, 22,106, and 22,240 protein-coding genes were predicted in T. osimensis, T. tokunoshimensis, and T. muenninki, respectively.These numbers were similar to those annotated in the closely related species mouse (22,173) and rat (22,215).Other statistics were similar to those for the mice and rats.Subsequently, the homology information of the mouse gene was assigned to the predicted genes as functional annotation.A homology search was performed using BLASTP v2.11.0+ 25 against the mouse RefSeq sequence 42 , and if a hit with more than 80% identity and coverage was found, the mouse annotation information was assigned to the corresponding predicted gene.Consequently, 18,469, 18,354, and 18,471 genes, which corresponded to approximately 83% of the total genes, were annotated for T. osimensis, T. tokunoshimensis, and T. muenninki, respectively.

technical Validation
DNA and RNA quality.Agarose gel electrophoresis was used to assess the quality of the extracted DNA.
The main band was >20 kb, and the DNA spectrophotometer ratio (260 nm/280 nm) was >1.8.The quality of the purified RNA molecules was examined by 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA), and RNA integrity (RIN) was >7.0.assembly evaluation.The quality and completeness of the chromosome assembly were evaluated using two independent approaches.First, the QV value and completeness were estimated using Merqury v1.3 72 by comparing k-mers in the assembly with those found in the Illumina sequence reads.The results revealed that QV values of the assembly were 45.21, 47.85, and 45.32 for T. osimensis, T. tokunoshimensis, and T. muenninki, whereas completeness values were 93.71%, 93.72%, and 93.14%, respectively.Second, the completeness of the assembly was assessed using BUSCO v5.4.7 73 (genome mode) with 13,798 single-copy orthologs from the glires_odb10 database.The analysis revealed that 96.15%, 96.14%, and 96.30% of the complete BUSCOs were identified in the genomes of T. osimensis, T. tokunoshimensis, and T. muenninki, respectively.

Gene annotation evaluation.
The completeness of the annotated protein-coding genes was assessed using BUSCO v5.4.7 73 (protein mode) with 13,798 single-copy orthologs from the glires_odb10 database.The analysis revealed that 98.70%, 98.80%, and 99.01% of the complete BUSCOs were identified in the annotated genes of T. osimensis, T. tokunoshimensis, and T. muenninki, respectively.

Fig. 2 K
Fig. 2 K-mer analysis of three Tokudaia species genomes.(a) T. osimensis (b) T. tokunoshimensis (c) T. muenninki.The estimated haploid genome sizes for T. osimensis, T. tokunoshimensis, and T. muenninki were 2,472.5 Mbp with 0.53% heterozygosity, 2,472.6Mbp with 0.25% heterozygosity, and 3,362.6Mbp with 0.45% heterozygosity, respectively.T. muenninki, with a genome size approximately 1 Gb larger than that of the other two species, shows consistency with a previous study demonstrating the presence of large heterochromatic blocks.

Fig. 3
Fig. 3 Genome-wide Hi-C contact map of three Tokudaia species genomes.The blue squares represent chromosomes.(a) Hi-C contact map of T. osimensis comprising 13 chromosome-level scaffolds (representing 12 autosomes and the X chromosome) (b) Hi-C contact map of T. tokunoshimensis comprising 23 chromosomelevel scaffolds (representing 22 autosomes and the X chromosome) (c) Hi-C contact map of T. muenninki comprising 23 chromosome-level scaffolds (representing 21 autosomes and the neo-and ancestral X regions).

Table 2 .
Assembly statistics of three Tokudaia species genomes.

Table 3 .
Gene annotation statistics of three Tokudaia species genomes.